The automatic identification of discourse units in Dutch text
نویسندگان
چکیده
The identification of discourse units is an essential step in discourse parsing, the automatic construction of a discourse structure from a text. We present a rule-based algorithm to identify elementary discourse units (EDUs) in Dutch written text. Contrary to approaches that focus on the determination of segment boundaries, we identify complete discourse units, which is especially helpful for the recognition of interrupted EDUs that contain embedded discourse units. We use syntactic and lexical information to decompose sentences into EDUs. Experimental results show that our algorithm for EDU identification performs well on texts of various genres.
منابع مشابه
Multi-Layer Discourse Annotation of a Dutch Text Corpus
We have compiled a corpus of 80 Dutch texts from expository and persuasive genres, which we annotated for rhetorical and genre-specific discourse structure, and lexical cohesion with the goal of creating a gold standard for further research. The annotations are based on a segmentation of the text in elementary discourse units that takes into account cues from syntax and punctuation. During the ...
متن کاملA Corpus-based Study of Lexical Bundles in Discussion Section of Medical Research Articles
There has been increasing interest in utilizing corpora in linguistic research and pedagogy in recent years. Rhetorical organization of different sections of research articles may appear similar in various disciplines, but close examination may show subtle differences nonetheless. One of the features that has been at the center of attention especially in recent years is the idiomaticity of a di...
متن کاملBuilding a Discourse-Annotated Dutch Text Corpus
We are compiling a corpus of Dutch texts annotated with discourse structure and lexical cohesion, containing initially 80 texts from expository and persuasive genres. We are using this resource for corpus-based studies of discourse relations, discourse markers, cohesion, and genre differences. We are also exploring the possibilities of automatic text segmentation and semi-automatic discourse an...
متن کاملComputational Analysis of Coherence Relations in Dutch
The NWO-programme Modelling textual organisation: coherence and cohesion studies the organisation of text into structural units by means of coherence (discourse relations between clausal and larger textual units) and cohesion (lexico-semantic relations between words in textual units). The programme is organised around two related PhD-projects, focussing on coherence and cohesion, respectively. ...
متن کاملTextuality: The ‘form’ to Be Focused on in SLA
Due to the special (procedural) nature of the language (verbal communication) ‘knowledge’, the dominant trends in applied linguistics research in the last few decades have been advocating ‘acquisition’ rather than ‘learning’ activities where the main focus in SL & FL education should be on ‘meaning’ while some ‘focus-on-form’ being justified. But the ‘form’ to be ‘focused-on’ is mostly misconce...
متن کامل